An Analysis of the #FeesMustFall Movement through Twitter Data

Important notebook instructions

Users can run all cells in this notebook to obtain the desired outputs - visualisations created with matplotlib will already appear, however, to reproduce the interactive plots the notebook must be run on a local machine. Before proceeding, two modules need to be installed from the command line. These include:

  1. Plotly (pip install plotly==4.5.0 or conda install -c plotly plotly=4.5.0)
  2. WordCloud (pip install wordcloud or conda install -c conda-forge wordcloud)

Moreover, due to the large dataset, several processes took hours to run. In these cases, I have saved the required objects to the 'pickle_files' folder and reloaded them when required. The original code is included, but has been placed in docstrings so that it will not be executed.

Running all cells should take around 1 minute 15 seconds in its current form. All cells can be run now.

In [1]:
# Import necessary modules
# Some of the these modules will require installation from the command line (e.g. dash, vaderSentiment etc.)

import numpy as np
import pandas as pd
import glob
import datetime as dt
import math
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from pandas.plotting import register_matplotlib_converters
import seaborn as sns
import time
import random
import json
import pickle
import csv
import networkx as nx
from scipy import stats
from operator import itemgetter
from IPython.display import display, HTML, Image
from PIL import Image
import re 
from sklearn.metrics import roc_curve, auc 

# The modules below require installation from the command line 
from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objs as go 
import plotly.express as px 
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode 

# The modules below are required to run code that is currently within docstrings - if you would
# like to run this code, these modules must be installed from the command line 
'''
import tweepy 
from langdetect import detect
from textblob import TextBlob 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
'''

%matplotlib inline
init_notebook_mode()

# Allows side by side dataframe/images in the specified cell
CSS = """
div.cell:nth-child(8) .output {
    flex-direction: row;
}
"""

HTML('<style>{}</style>'.format(CSS))
Out[1]:

Introduction

FeesMustFall is a South African student-led protest movement born out of the excessively high tuition fees relative to average national income, making higher education inaccessible for most of the population.

During my first year of university in Cape Town, I distinctly remember the sound of gunshots and stun grenades from outside the doors of our residence hall. These were the measures taken by the South African Police Force (SAPS) to control increasingly violent protests. Having been a part of this defining moment in South African education has motivated me to analyse the protests from a statistical perspective.

The protests began in mid-October 2015 and focused on attaining two primary goals:

  1. Abolish annual increases in tertiary tuition fees.
  2. Increase government funding of universities.

Protests started at the University of Witwatersrand and spread to the University of Cape Town and Rhodes University. Within several weeks, protests were country-wide ultimately resulting in a national education crisis with an estimated total cost of $44.25 million, just in property damage. Images of protests at two universities during the crisis are shown below.

Having lived among the leaders of the #FeesMustFall movement in Leo Marquard Hall, I recall Twitter playing a major role in mobilising the youth, co-ordinating protests and providing a platform for debate surrounding the topic. This notebook delves deeper into the events of the 2015/2016 education crisis from a data perspective.

University of Cape Town Wits University

 Research Question/Objectives

Research Questions

  1. Is there a correlation between Twitter activity and protest action?
  2. Who are the influential participants and mobilisers in the protests? Is there a high level of interaction between these users/institutions and can any clusters be identified?
  3. What was the general sentiment surrounding the protests?

Objectives

  1. Provide insight into the volume of tweets with the #FeesMustFall tag over time. Gain a deeper understanding on how the movement gained traction by linking tweet volume to significant events. Delve deeper into the link between Twitter activity and protest action.
  2. Examine the network of tweeters and clusters underlying the core data. Identify key institutions, their links and their influence on the movement.
  3. Perform sentiment analysis on the tweets to determine public perception of the protests.

Dataset

The data to be analysed in this notebook consists of three datasets (drawn from two sources):

  1. Core data: consists of tweets that contain the tag #FeesMustFall taken from the period 21/03/2015 - 31/10/2016 scraped from the command line.
  2. Benchmark data: consists of tweets (from 01/02/2017 onwards) from users who posted at least once with the #FeesMustFall tag. Data was obtained using the Twitter API.
  3. Network data: consists of a list of nodes and edges. This data is primarily obtained by manipulating and shaping the core data.

The table below contains information on the core and benchmark data. Variables removed during the cleaning phase are not included.

Dataset Tweet count (before clean) Tweet count (after clean) Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 7
Core 447204 352841 date username replies retweets favorites text mentions hashtags permalink
Benchmark 112223 27342 created_at text favorites retweets in_reply_to

Core and benchmark data tweet volume will be used to investigate whether a correlation between Twitter activity and protest action exists. Next, the core data will be aggregated so as to further insight into the movement's key participants. Lastly, the actual tweet text from the core data is analysed to ascertain whether attitudes towards the movement were positive or negative.

While the qualitative nature of the data makes descriptive statistics less useful, a summary table and plot are presented below to give a better idea of the volume of tweets in the dataset. Each data point is the tweet frequency per hour of the day. On average, the dataset consists of 15000 tweets for every hour of the day, enough data to draw insightful inferences.

In [2]:
sum_stats = pd.read_pickle('pickle_files/sum_stats')
display(sum_stats)
boxplot = Image.open('images/summary_boxplot.png')
boxplot.thumbnail((400,800))
display(boxplot)

CSS = """
div.cell:nth-child(33) .output {
    flex-direction: row;
}
"""

HTML('<style>{}</style>'.format(CSS))
Value
Number of data points (Hours in day) 24.000000
Average Tweets per hour 14701.708333
Standard Deviation 9170.805064
Min Tweets per hour 1621.000000
Lower Quartile (25%) 7396.250000
Median (50%) 14965.000000
Upper Quartile (75%) 21703.750000
Max Tweets per hour 31402.000000
Out[2]:

Scraping the data

This section includes explanations and code with details on how the various datasets were scraped.

Core data (using the command line)

Twitter have restricted free developer accounts from accessing tweets (by text and tags) further than 7 days in the past. Moreover, the paid API service limits users to 100 daily tweets. These limitations are significant when attempting to perform analysis on more than 350000 tweets. For this reason, alternative scraping methods were utilised to obtain the core data. Marquisvictor's GetOldTweets repository is used as an aid to scrape all tweets from the command line. In essence, this algorithm automates manual scrolling through Twitter. Given the required information, it scrapes Twitter data and metadata directly from the browser.

After installing the relevant packages, all tweets with the #FeesMustFall tag are saved into 4 csv files. Varying periods are used for each csv file to ensure that the scrape is not large enough to lead to a termination of the request. This procedure is carried out as indicated in the image below. Scraping the entire date range took approximately 21 hours.

Method used to scrape all historical tweets

In [3]:
# Read core data into memory from the scraped csv files

path = r'../GregAdamMeyer' 
all_files = glob.glob(path + "/*.csv")

csv_li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    csv_li.append(df)

df = pd.concat(csv_li, axis=0, ignore_index=False)
df.head(3)
Out[3]:
date username to replies retweets favorites text geo mentions hashtags id permalink
0 31/10/2016 23:42 LesIzmoreKC NaN 0 1 0 Seven university protests around the world tha... NaN @thedailyvox #FeesMustFall 7.930000e+17 https://twitter.com/LesIzmoreKC/status/7932366...
1 31/10/2016 23:40 SupremeCFC NaN 2 0 1 So now at CUT every student qualifies to write... NaN NaN #FeesMustFall 7.930000e+17 https://twitter.com/SupremeCFC/status/79323619...
2 31/10/2016 23:40 camaripop NaN 0 0 0 Thought Factory: #FeesMustFall: So Why Shutdow... NaN NaN #FeesMustFall #SouthAfrica #BLM #leadership #i... 7.930000e+17 https://twitter.com/camaripop/status/793236161...

Benchmark/auxiliary data (using the Twitter API)

     -- will be referred to as the benchmark data

This notebook proceeds to use the Twitter API to scrape tweets from random users who tweeted with the #FeesMustFall tag in the data above in order to draw conclusions about the engagement surrounding #FeesMustFall. Unlike the case with the core data, tweets older than 7 days can be scraped using the API provided the search parameter is the user's screen name and not the tweet text. During the cleaning phase, I only take tweets from 4 months after the protests onwards to ensure this data is not related to #FeesMustFall and hence a good benchmark against which to compare the core data. Scraping the data below took approximately 1 hour.

In [4]:
# Scrape auxilary data from Twitter API and read it into memory

# Load credentials from json file
with open("../GregAdamMeyer/twitter_api.json", 
          "r") as file:
    secrets = json.load(file)

api_key = secrets['CONSUMER_KEY']
api_secret_key = secrets['CONSUMER_SECRET']
access_token = secrets['ACCESS_TOKEN']
access_token_secret = secrets['ACCESS_SECRET']

'''
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
'''

def get_tweets(screen_name):
    tweets = []
    # Initial request
    new_tweets = api.user_timeline(screen_name = screen_name,count=200)
    tweets.extend(new_tweets)
    # Save the id of the oldest tweet less one to avoid duplication
    oldest = tweets[-1].id - 1
    # Extract tweets until there are none left
    while len(new_tweets) > 0:
        new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
        tweets.extend(new_tweets)
        oldest = tweets[-1].id - 1    
    # Transform array into a format that will be written to a csv file
    outtweets = [[tweet.created_at, tweet.text, tweet.favorite_count, tweet.retweet_count, 
                  tweet.in_reply_to_screen_name] for tweet in tweets]
    # Write to csv
    with open('../2019mt-st445-project-GregAdamMeyer/random_users/' 
            + '%s_tweets.csv' % screen_name, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(["created_at","text","favorites","retweets",
                         "in_reply_to"])
        writer.writerows(outtweets)
        

# Obtain a list of unique users who tweeted with the #FeesMustFall tag
users = set(df['username'])        

# Take 50 of the users above, get their most recent tweets and save them to csv files
'''
for count, user in enumerate(users):
    get_tweets(user)
    if count == 50:
        break 
'''

# Read the benchmark data into memory from the csv files
path = r'../GregAdamMeyer/random_users' 
user_files = glob.glob(path + "/*.csv")

csv_l = []

for users in user_files:
    df_rnd = pd.read_csv(users, index_col=None, header=0)
    csv_l.append(df_rnd)

df_rnd = pd.concat(csv_l, axis=0, ignore_index=False)
df_rnd.head(3)
Out[4]:
created_at text favorites retweets in_reply_to
0 2020-01-05 19:32:58 @ThatoMamathuba @tds122 Awwww Tony 💓❤ LOL we a... 2 0 ThatoMamathuba
1 2020-01-05 19:27:56 @tds122 @ThatoMamathuba Yeah versus Eng it's 3... 2 0 tds122
2 2020-01-05 19:23:45 @juliantsepo You do understand humour, right? 1 0 juliantsepo

Cleaning the data

Core data

The following steps are carried out during the initial data clean:

  1. Drop all duplicate rows that arise as a result of an overlap in web scrape dates.
  2. Convert dates to datetime objects.
  3. Remove all replies - for most of this analysis, I am only interested in tweets. I store replies in an alternative dataframe which will be used for engagement analysis.
  4. Drop unwanted columns
    • Since a user can optionally link their tweet to a geographic location, selecting tweets based on geotags significantly decreases the number of extracted tweets leading to poorer data quality. For this reason, geographical analysis does not form part of this investigation.
    • Some preliminary analysis has shown that the scraping produced several tweets with the same ID - for this reason I drop the ID column and reference each tweet by an index number, sorted in chronological order.

Different versions of the core dataframe are required for different visualisations of the data. The initial data clean is done in this section, and further ad hoc data pivots, groupbys and other cleaning methods are performed throughout the notebook.

In [5]:
# Clean core data

# First drop the duplicates in each scraped csv due to overlapping scrape dates
# Subset by permalink as these links will be unique
df.drop_duplicates(keep = 'first', inplace = True, subset = 'permalink')
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df_replies = df[~df['to'].isnull()] # keep the replies in a different dataframe
df = df[df['to'].isnull()] # only want tweets that aren't replies
df.drop(columns=['to', 'geo', 'id'], inplace = True) # drop unwanted columns
df.sort_values(by = 'date' ,inplace = True)
df = df.reset_index()
df.drop(columns='index', inplace = True) # drop extra index column
display(df.head(3))
# df will remain unchanged and act as the core data - manipulations and pivots will be 
# performed on copies of this dataframe
date username replies retweets favorites text mentions hashtags permalink
0 2015-03-21 14:28:00 SkhumbuzoTuswa 2 3 6 Priorities?? #FeesMustFall RT @informer_sa: UC... @informer_sa #FeesMustFall #RhodesMustFall https://twitter.com/SkhumbuzoTuswa/status/5792...
1 2015-04-07 04:31:00 SSSIBIYA 0 1 3 #FEESMustFall that will make sense to me. Free... NaN #FEESMustFall https://twitter.com/SSSIBIYA/status/5852989398...
2 2015-10-13 17:07:00 SmartBlackZA 0 6 1 Now it is time for me to mobilize Wits student... NaN #FeesMustFall https://twitter.com/SmartBlackZA/status/653980...

Benchmark data

The benchmark data is cleaned in a similar manner to the core data. However, due to the contrasting output format as a result of using the API, as well as the purpose for which this data is to be used, two additional steps are taken:

  1. I only retain tweets occurring from February 2017 onwards to ensure that most tweets in this dataset are not linked to #FeesMustFall. This allows for comparison of the engagement distribution over time between #FeesMustFall tweets vs non #FeesMustFall tweets.
  2. The scraped dataset contains retweets - these tweets are removed to focus on the analysis of original tweets. Retweets distort temporality and volume analysis.
In [6]:
# Clean benchmark tweet data

df_rnd['created_at'] = pd.to_datetime(df_rnd['created_at'])
df_rnd.head(15)
# Only want tweets from well after the protest to ensure that the bulk of 
# these tweets are not related to the #FeesMustFall topic - this will allow
# for a more accurate comparison 
df_rnd = df_rnd[df_rnd['created_at'] > dt.datetime(2017,2,1,0,0,0)]
df_rnd.drop_duplicates(inplace = True)
df_rnd = df_rnd[df_rnd['in_reply_to'].isnull()] # don't want tweets that are replies
# Dataset contains retweets - need to remove retweets to obtain only user original tweets
df_rnd = df_rnd.reset_index()
# Create a column with the first two letters of the tweet
df_rnd['first_2_letters'] = df_rnd['text'].astype(str).str[0:2]
# Remove retweets 
df_rnd = df_rnd[df_rnd['first_2_letters']!='RT']
df_rnd = df_rnd.reset_index()
df_rnd.drop(columns=['index', 'first_2_letters', 'level_0'], inplace = True)
df_rnd.tail(3)
Out[6]:
created_at text favorites retweets in_reply_to
27339 2017-02-03 12:18:00 one person followed me and one person unfollow... 0 0 NaN
27340 2017-02-03 10:54:08 #Droogte: Dis mos nou hoe jy #reën vier! Boere... 0 0 NaN
27341 2017-02-01 09:25:36 #RareKanker: Bloem-vrou sterf ná stryd teen ra... 0 0 NaN

Network data

This notebook proceeds to shape the core data for network analysis. First, an undirected network is created. This is in the form of a list where each element consists of two connected nodes in set format. Nodes/users are considered to be connected in a #FeesMustFall context if the users have interacted via mentions or replies. A network with 112011 edges is obtained.

The following steps were taken to clean and shape the data for network analysis:

  1. Subset df to contain rows where mentions are made.
  2. Cycle through the dataframe created in the previous step (df_nx) and the replies dataframe (df_replies) adding required edges to the ntwrk array. When adding edges from df_nx, the parse_mentions function is used to remove unwanted characters.
In [7]:
# Create network of nodes and edges
# First, create a network consisting of all users 
# Trim the network later on to aid visualisation

'''
# Subset tweets that have mentions
df_nx = df[~df['mentions'].isnull()][['username', 'mentions']]
df_nx.reset_index(inplace = True)
df_nx.drop(columns = 'index', inplace = True)

def parse_mentions(mentions):
    # Parses all the mentions for a specific tweet
    # as a list and removes the @ symbol (docstrings
    # not used here as the entire block of code is
    # wrapped in a docstring)
    m_list = mentions.split()
    for i in range(len(m_list)):
        m_list[i] = m_list[i].lstrip('@')
    
    return m_list

# Cycle through all user/mention combinations and add them
# to the ntwrk list if they are not already there
ntwrk = []
for entry in df_nx.itertuples():
    for each_mention in parse_mentions(entry.mentions):
        n_edge = {entry.username, each_mention}
        if n_edge not in ntwrk:
            ntwrk.append(n_edge)

# Next, cycle through all user replies and add them to the ntwrk
# list if they are not already there
df_replies = df_replies[['username', 'to']]
for entry in df_replies.itertuples():
    n_edge = {entry.username, entry.to}
    if n_edge not in ntwrk:
        ntwrk.append(n_edge)

# Remove people who mention/reply to themselves
ntwrk_no_self_mentioners = []
for i in ntwrk:
    if len(i) == 2:
        ntwrk_no_self_mentioners.append(i)
        
ntwrk = ntwrk_no_self_mentioners

# The ntwrk list takes some time to generate so we just save it 
# as a pickle file - the code below can be deleted if you would
# prefer to run the code

with open('pickle_files/ntwrk.pickle', 'wb') as handle:
    pickle.dump(ntwrk, handle, protocol=pickle.HIGHEST_PROTOCOL)
'''    

with open('pickle_files/ntwrk.pickle', 'rb') as handle:
    ntwrk = pickle.load(handle)

Visualising the data

Objective 1: Determining if Twitter activity is linked to protest action

Unfolding of events

The #FeesMustFall movement kicked off after the announcement that university fees for 2015 would increase by 10.5%. The tweet below summarises how many students felt. Much of the momentum the protests gained can be attributed to these feelings.

Firstly, this section gives a brief overview of the unfolding of events. Secondly, the section explores whether the prominence of the #FeesMustFall movement and periods of intense protest was linked to the volume of tweets over time.

In [8]:
# Plot volume of tweets by date

df_man = df.iloc[:]
df_man['date'] = df['date'].dt.floor('T') # remove seconds - allows for better plot visualisation
df_man['date_temp'] = [i.date() for i in df_man['date']]
volume = (df_man.groupby('date_temp')['username'].count())

register_matplotlib_converters()
fig, ax = plt.subplots(figsize=(18, 6))

plt.plot(volume,'r')
plt.title('Number of #FeesMustFall tweets over time', fontsize = 15)
ax.set_facecolor('whitesmoke')
plt.xlabel('Date')
plt.ylabel('Tweet Frequency')
plt.xlim([dt.date(2015,9,1), dt.date(2016,11,15)])
# Plot dashed lines to show periods of intense protests
plt.axvline(dt.date(2015, 10, 8), color='b', linestyle='dashed', linewidth=1)
plt.axvline(dt.date(2015, 10, 29), color='b', linestyle='dashed', linewidth=1)
plt.axvline(dt.date(2016, 9, 19), color='b', linestyle='dashed', linewidth=1)
plt.axvline(dt.date(2016, 10, 28), color='b', linestyle='dashed', linewidth=1)

plt.show()

The plot above gives initial insight into the periods of peak Twitter activity. The dashed blue lines represent the start and end dates of heightened protests for each respective year. The link between Twitter activity and protest actions is explored further in the next plot.

In [9]:
# Create interactive volume plot

volume_df = pd.DataFrame(volume)
volume_df = volume_df.rename({'username': 'tweet_count'} ,axis = 'columns')
volume_df['date'] = volume_df.index
volume_df['year'] = volume_df['date'].map(lambda x: x.year)
volume_df['month'] = volume_df['date'].map(lambda x: x.month)

# We now create the points for the significant dates to be marked on the plotly line graph
signif_dates = [dt.date(2015,9,14),dt.date(2015,10,8),dt.date(2015,10,14),
                dt.date(2015,10,19),dt.date(2015,10,20),dt.date(2015,10,21),
                dt.date(2015,10,23),dt.date(2015,11,1),dt.date(2016,9,19),
                dt.date(2016,10,10),dt.date(2016,10,19),dt.date(2016,11,1)]
for day in signif_dates: # add significant dates with zero tweet count to dataframe
    if day not in volume_df.index:
        volume_df.loc[day] = [0, day, day.year, day.month]
volume_df.sort_values(by = 'date' ,inplace = True)

scatter_df = volume_df[volume_df.index.isin(signif_dates)]

volume_df[volume_df.index.isin([dt.date(2015,11,1),dt.date(2016,10,15)])]

fig = px.line(volume_df, x = volume_df.index, y = 'tweet_count', 
              title='Number of #FeesMustFall tweets over time',
              color = 'year', hover_data = ['tweet_count'],
              labels = {'year':'Year', 'tweet_count': 'Tweet Count',
                                       'x':'Date'})

actual_text = ['<br>Signs of student unrest in KZN</br>',
               '<br>First major protest </br> Presidential task team announced',
               '<br>Birth of the #FeesMustFall movement </br> Systematic shutdown of \
               South African universities <br>Protests escalate</br>',
               '<br>Protests are countrywide</br>Courts grant interdicts against students <br>\
               Police and students engage in violent confrontations</br>',
               '<br>6% fee increase announced </br>Students reject this proposition',
               '<br>Parliamentary grounds are breached </br>Students controlled with stun grenades,\
               teargas and riot shields<br>Multiple arrests take place</br>',
               '<br>0% fee increase announced </br>Protests reach their peak <br>Riots in Pretoria\
               - police vehicles are burned</br>',
               '<br>Protests start losing momentum </br>Lectures take place online\
               <br>Students given the option to defer exams</br>',
               '<br>8% fee increase for 2016 announced </br>Movement regains momentum\
               <br> Universities shut down and move to online lectures </br>',
               '<br>Rubber bullets, stun grenades and smoke grenades fired at students\
               attempting to enter the Great Hall at Wits</br>',
               '<br>Two security guards beaten by students at UCT </br>Private security\
               companies contracted to protect the campus',
               '<br>Steel structure exam venue constructed on the UCT rugby field </br>\
               Exam venue secured by private security and canine units']
hovertext = ['<b>' + str(day) + '</b>' + actual_text[i] for i, day in enumerate(signif_dates)]

fig.add_trace(go.Scatter(
    x=scatter_df.index,
    y=scatter_df.tweet_count,
    mode='markers',
    name='markers',
    hovertext=hovertext,
    hoverinfo="text",
    marker=dict(
        color="green",
        size = [10]*12
    ),
    showlegend=False
))

fig.update_layout(xaxis_title="Date", yaxis_title="Tweet Frequency",
                 xaxis_range=[dt.date(2015,9,1), dt.date(2016,11,15)],
                 xaxis_rangeslider_visible=True)

fig.show()

Given information surrounding pivotal events in the movement, it becomes clear that there is a strong correlation between tweet frequency and protest action. Hover over the green circles above for information on events occurring on specific dates. You can also use the range slider at the bottom to restrict the dates in view.

The significant events corresponding to the green circles above are described in greater detail below:

  1. 14 September 2015: The first protests voicing student dissatisfaction with high fees occurs at UKZN. Students set fire to buildings and engage in battles with police and security. The #FeesMustFall tag is yet to be created.
  2. 8 October 2015: The first major protest against tuition fee increases - the president is now involved. He announces a task team to investigate solutions.
  3. 14 October 2015: Escalation of protests - the start of a systematic shutdown of universities across South Africa. Entrances to universities are blocked. #FeesMustFall is announced as a formal movement on this day.
  4. 15-18 October 2015: Protests continue to escalate - the movement is gaining traction.
  5. 19 October 2015: Protests spread to all parts of the country - court interdicts are obtained authorising police to engage with students on campus.
  6. 20 October 2015: Minister of higher education, Nzimande, announces a cap of 6% on fee increases - students reject this.
  7. 21 October 2015: Chaos erupts - 5000 UCT and CPUT break into the national parliamentary grounds resulting in violent engagements with police. Students are removed with stun grenades, riot shields and batons. Multiple arrests take place. Similar scenes occur at NMMU in Port Elizabeth where students are pushed back by rubber bullets and teargas.
  8. 23 October 2015: Announcement of a 0% fee increase by the president in Pretoria. The crowd becomes angered that the president does not descend to the grounds to address them. Riots, burning of police vehicles, rubber bullets, teargas and flash-bangs ensue.
  9. November 2015: Protests slowly lose momentum after the 0% fee increase announcement. Classes at university do not resume and students are lectured online. Exams proceed under heightened security. Students are given the option to defer exams to the following year.
  10. 19 September 2016: Protests continue into 2016, but with less momentum than in 2015. This can be seen by the significantly lower tweet volume. The movement regains momentum when an 8% fee increased is announced. Universities shut down and shift to an online learning platform.
  11. 10 October 2016: Police fire rubber bullets, stun grenades and smoke grenades at students attempting to gain access into the Wits Great Hall.
  12. 19 October 2016: Two security guards are beaten with steel rods taped with masking tape at UCT. Private security companies are contracted to control protests on campus.
  13. November 2016: A steel structure on the UCT rugby field is constructed allowing security to successfully lockdown the exam venue. The venue is heavily secured by armed specialist security forces and canine units.

From the above, it is clear that a spike in tweets with the tag #FeesMustFall coincided with periods of heavy protest action and defining events. The strong link between Twitter activity and protest action is apparent.

*University abbreviations used above:
UKZN - University of KwaZulu-Natal (Durban)
UCT - University of Cape Town (Cape Town)
Wits - University of the Witwatersrand (Johannesburg)
CPUT - Cape Peninsula University of Technology (Cape Town)
NMMU - Nelson Mandela Metropolitan University (Port Elizabeth)

Twitter as a tool to mobilise protestors

In [10]:
# Plot tweet volume by time of day

# Group tweets by the time of day they were tweeted
df_man['time_temp'] = [i.time() for i in df_man['date']]
volume_time = df_man.groupby('time_temp').size()

# Create a list of the hourly count of tweets over the entire date range
hourly_count = []
freq = 0
for time, frequency in zip(volume_time.index, volume_time):
    freq += frequency
    if time.minute == 59:
        hourly_count.append(freq)
        freq = 0

# Create a simple list of date time objects for every hour
hour_list = []
for hour in range(24):
    hour_list.append(dt.time(hour, 0, 0))
    
fig, ax = plt.subplots(figsize=(18, 6))    

ax.plot(hour_list, hourly_count, 'royalblue')
ax.set_xlabel("Time of day")
ax.set_ylabel("Tweet Frequency (by hour)")
ax.set_title("Tweet frequency across different times of the day")
ax.set_facecolor('whitesmoke')
ax.set_label('Minutely frequency')


ax2 = ax.twinx()
ax2.plot(volume_time, 'g')
ax2.set_ylabel('Tweet Frequency (by minute)')
ax2.set_xticks([dt.time(hour, 0, 0) for hour in range(24)])

green_patch = mpatches.Patch(color='g', label='Minutely frequency')
r_blue_patch = mpatches.Patch(color='royalblue', label='Hourly frequency')
plt.legend(handles=[r_blue_patch, green_patch], loc = 'upper left')

plt.show()

# ====================================================================================

# Create summary table and boxplot to be shown as descriptive statistics (output of
# this code is presented earlier in the notebook)

hourly_count = pd.DataFrame(hourly_count)
hourly_count = hourly_count.rename({0: 'Value'} ,axis = 'columns')

sum_stats = hourly_count.describe()

sum_stats.rename({'count': 'Number of data points (Hours in day)',
                  'mean': 'Average Tweets per hour',
                 'std': 'Standard Deviation', 'min': 'Min Tweets per hour',
                 '25%': 'Lower Quartile (25%)', '50%': 'Median (50%)', 
                 '75%': 'Upper Quartile (75%)', 'max': 'Max Tweets per hour'},
                 axis='index', inplace = True)
sum_stats.rename({'replies': 'Replies', 'retweets': 'Retweets',
                 'favorites': 'Favourites'},
                 axis='columns', inplace = True)

# Save to pickle file so df can be shown in Dataset chapter
sum_stats.to_pickle('pickle_files/sum_stats')

hourly_count = hourly_count.rename({'Value': 'Hourly Volume'} ,axis = 'columns')

fig, ax = plt.subplots()
sns.boxplot(data=hourly_count[['Hourly Volume']],notch = True,color = 'g',
           saturation = 1)
plt.title('Boxplot showing distribution of hourly tweets')
fig.tight_layout()
plt.savefig('images/summary_boxplot.png', dpi=300)
plt.close(fig)

The plot above illustrates hourly (left y-axis) and minutely (right y-axis) tweet volume. With the knowledge that the majority of protests occurred around midday, the obvious spike in Twitter activity during this time strengthens the claim that protests are heavily linked to Twitter activity.

In [11]:
# Plot tweet volume by engagement type for core data

engage = df_man.groupby('time_temp')['replies', 'retweets', 'favorites'].sum()

labs = ['Replies', 'Retweets', 'Favorites']
plt.figure(figsize=(18,6))
plt.stackplot(engage.index, engage['replies'], engage['retweets'], engage['favorites'], 
              labels = labs, colors = ['blue', 'orangered', 'green'])
plt.legend(fontsize = 12)
plt.xticks(3600*np.arange(0, 26, 2), ('0:00', '2:00', '4:00', '6:00', '8:00', 
                                      '10:00', '12:00', '14:00', '16:00', 
                                      '18:00', '20:00', '22:00', '24:00'))
plt.xlabel('Time of day')
plt.ylabel('Engagement frequency')
plt.title('Engagement type proportions for #FeesMustFall data')

# Plot zoomed figure
engage_zoom = engage[2*60:2*60+30]
sub_axes = plt.axes([.185, .55, .25, .25]) # location on original graph
sub_axes.stackplot(engage_zoom.index, engage_zoom['replies'], engage_zoom['retweets'], 
              engage_zoom['favorites'], labels = labs, colors = ['blue', 'orangered', 'green'])
sub_axes.set_xticks([dt.time(2, 5*minute, 0) for minute in range(7)])
sub_axes.set_xlabel('Time of day')
sub_axes.set_ylabel('Engagement frequency')

plt.show()

# RAM is overloaded - delete these dataframes to improve runtime
del df_man
del volume
del volume_df
del scatter_df

The above plot analyses the split between likes, replies and retweets. Replies are a form of active engagement where the user is looking to engage in discussion whereas likes and retweets function as a means to spread a message or express agreement with the tweet in question.

The plot makes clear that #FeesMustFall related tweets are predominantly engaged via retweets reinforcing the active nature of the movement. The zoomed plot in the top left corner demonstrates that this trend holds, even at low engagement frequency hours of the day.

We proceed to compare the above plot to benchmark data in order to ascertain whether this trend is specific to the #FeesMustFall movement.

In [12]:
# Plot tweet volume by engagement type for benchmark data

# Remove seconds - allows for better plot visualisation
df_rnd['created_at'] = df_rnd['created_at'].dt.floor('T') 
df_rnd['time_temp'] = [i.time() for i in df_rnd['created_at']]
rnd_tweets = df_rnd.groupby('time_temp')['retweets', 'favorites'].sum()
# Group the stackplot data by hour as a result of the large fluctuation in the minutely data
rnd_tweets['index'] = rnd_tweets.index
rnd_tweets['hour'] = rnd_tweets['index'].apply(lambda x: x.hour)
rnd_tweets = rnd_tweets.groupby('hour')['retweets', 'favorites'].sum()
rnd_tweets

rnd_labs = ['Retweets', 'Favorites']
plt.figure(figsize=(18,6))
plt.stackplot(rnd_tweets.index, rnd_tweets['retweets'], 
              rnd_tweets['favorites'], labels = rnd_labs,
             colors = ['orangered', 'green'])
 
plt.xticks(0.958*np.arange(0, 26, 2), ('0:00', '2:00', '4:00', '6:00', '8:00', 
                                      '10:00', '12:00', '14:00', '16:00', 
                                      '18:00', '20:00', '22:00', '24:00'))

plt.legend()
plt.xlabel('Time of day')
plt.ylabel('Engagement frequency')
plt.title('Engagement type proportions for benchmark data')

plt.show()

Data in the plot above is taken from a sample of users who actively tweeted about #FeesMustFall. However, the data in question involves these users' tweets from 6 months after the protests until 2020/01/05 ensuring the majority of these tweets are not related to #FeesMustFall. This allows for meaningful comparison. Unfortunately, one requires a premium API subscription to access the reply count (evidenced here) and so replies are excluded as a metric in this plot.

Retweets form a much smaller proportion of engagement in the plot above relative to #FeesMustFall engagement. This suggests the retweet proportion of engagement on the #FeesMustFall topic is abnormally high. This reinforces the notion that the movement was primarily concerned with action, rather than discussion.

Summary of findings

Anaylsis in this section suggests Twitter activity was linked to protest action. This is evidenced by:

  1. Higher activity on protest dates
  2. Higher activity during times when protests occur
  3. Retweets (used to mobilise protestors) being the primary form of engagement

Objective 2: Identifying key participants and clusters

Network analysis

This notebook proceeds to analyse interactions, influential users/institutions and clusters. Using the ntwrk dataset, the following key statistics within the network are calculated:

  1. Total nodes and egdes: gives an idea of the scope of users and interconnectedness between them.
  2. Information on the degree of notable nodes: the degree refers to the number of edges a particular node has.
  3. Number of connected components in the graph: this refers to the number of subgraphs in which a path exists between every node.
  4. Largest subgraph: the largest subgraph where a path exists between every node.
  5. Clustering and transitivity: these are both measures of the degree to which nodes tend to cluster together and hence are useful for determining whether users create tightly-knit groups. Clustering places high emphasis on low-degree nodes whereas transitivity emphasises the higher degree nodes. More information on these measures can be found here.

To best visualise the network structure, the network of the largest connected subgraph is depicted. This gives insight into interactions between the most active users during the #FeesMustFall campaign.

In [13]:
# Perform network analysis on #FeesMustFall users

'''
# Create graph by adding edges from ntwrk
G = nx.Graph()

for node_1, node_2 in ntwrk:
    G.add_edge(node_1, node_2, weight=1)
       
degrees = [val for (node, val) in G.degree()]
nx.is_connected(G) # graph isn't connected so we look at connected subgraphs
largest_subgraph = max((G.subgraph(c) for c in nx.connected_components(G)), key=len)

# Took 24 hours to run below code because network is large and centrality calculations are 
# computationally heavy
graph_centrality = nx.degree_centrality(largest_subgraph)
max_de = max(graph_centrality.items(), key=itemgetter(1)) # output: eNCA with value of 0.05
graph_closeness = nx.closeness_centrality(largest_subgraph)
max_clo = max(graph_closeness.items(), key=itemgetter(1)) # output: eNCA with value of 0.35
graph_betweenness = nx.betweenness_centrality(largest_subgraph, normalized=True, endpoints=False)
max_bet = max(graph_betweenness.items(), key=itemgetter(1)) # output: eNCA with value of 0.16


network_stats = pd.DataFrame({'Value':[nx.number_of_nodes(G),nx.number_of_edges(G),
                                       np.max(degrees),np.min(degrees),np.mean(degrees),
                                       stats.mode(degrees)[0][0],
                                       nx.number_connected_components(G),
                                       largest_subgraph.number_of_nodes(),
                                       largest_subgraph.number_of_edges(),
                                       nx.average_clustering(largest_subgraph),
                                       nx.transitivity(largest_subgraph),
                                       'eNCA/0.05','eNCA/0.35','eNCA/0.16']},
                             index = ['Number of nodes','Number of edges',
                                     'Max degree','Min degree','Average degree',
                                     'Most frequent degree',
                                      'Number of connected components',
                                     'Number of nodes in largest subgraph',
                                     'Number of edges in largest subgraph',
                                     'Clustering co-efficient (largest subgraph)',
                                     'Transitivity (largest subgraph)',
                                     'Node with highest degree centrality/value',
                                     'Node with closeness degree centrality/value',
                                     'Node with betweenness degree centrality/value'])

with open('pickle_files/network_stats.pickle', 'wb') as handle:
    pickle.dump(network_stats, handle, protocol=pickle.HIGHEST_PROTOCOL)
'''
with open('pickle_files/network_stats.pickle', 'rb') as handle:
    network_stats = pickle.load(handle)

# Plot the network of the largest subgraph
# Position nodes using Fruchterman-Reingold force-directed algorithm.

'''
node_and_degree = largest_subgraph.degree()
colors_central_node = ['red']
central_nodes = ['eNCA']

pos = nx.spring_layout(largest_subgraph, k=0.05)
fig = plt.figure(figsize = (20,20))
nx.draw(largest_subgraph, pos=pos, node_color=range(largest_subgraph.number_of_nodes()), 
        cmap=plt.cm.PiYG, edge_color="black", linewidths=0.3, node_size=60, alpha=0.6, 
        with_labels=False)
nx.draw_networkx_nodes(largest_subgraph, pos=pos, nodelist=central_nodes, node_size=300, 
                       node_color=colors_central_node)

fig.savefig("images/full_net.png", bbox_inches='tight', dpi=600)
'''

display(network_stats)

net = Image.open("images/full_net.png")
net.thumbnail((420,420))
display(net)
Value
Number of nodes 56522
Number of edges 112011
Max degree 2577
Min degree 1
Average degree 3.96345
Most frequent degree 1
Number of connected components 3177
Number of nodes in largest subgraph 49292
Number of edges in largest subgraph 107940
Clustering co-efficient (largest subgraph) 0.0322928
Transitivity (largest subgraph) 0.00506207
Node with highest degree centrality/value eNCA/0.05
Node with closeness degree centrality/value eNCA/0.35
Node with betweenness degree centrality/value eNCA/0.16

The graphic above is a depiction of the largest subgraph centred around the eNCA node - details on the features of the graph are available on the left. Unfortunately, there are too many nodes to be able to visualise the network successfully. Therefore, alternative ways to visualise the network are considered.

This notebook proceeds to focus on the network of influential users in the movement. The motivation behind this is that the clustering co-efficient is larger than the transitivity ratio suggesting the possibility that low-degree nodes form clusters around higher degree nodes. This possibility is explored in the next few cells with a particular focus on accounts affiliated with the respective South African universities.

In [14]:
# Analyse network of influential users

conn_dict = {}
for node_1, node_2 in ntwrk:
    if node_1 in conn_dict:
        conn_dict[node_1]+=1
    else:
        conn_dict[node_1] = 1
    if node_2 in conn_dict:
        conn_dict[node_2]+=1
    else:
        conn_dict[node_2] = 1

# Analyse users with more than 100 connections and then extract key participants
# in the protests who are affiliated to an institution of which most are obtained 
# from this list 
for user in conn_dict:
    if conn_dict[user]>100:
        print(user, end = ' | ')
        
significant_users = ['TuksUPrising', 'FeesMustFall',
                    'RhodesMustFall','WitsFMF','RhodesSRC']
 
pop_ntwrk = []
for node_1, node_2 in ntwrk:
    if node_1 in significant_users or node_2 in significant_users:
            pop_ntwrk.append([node_1, node_2])
            
print('\n\nThe significant users to be analysed are:\n' , significant_users)
WitsUniversity | khayadlanga | djsbu | WitsSRC | UPTuks | IOL | SASCO_Jikelele | SABCNewsOnline | gwalax | vuyanipambo | SAPoliceService | Mngxitama | MyANC_ | TuksUPrising | SakinaKamwendo | Harold_Ferwood | BladeNzimandeMP | RhodesMustFall | simamkeleD | ewnreporter | eNCA | ANN7tv | UCT_news | Julius_S_Malema | EconFreedomZA | Yfm | PresidencyZA | BigDaddyLiberty | Anele_Nzimande | Radio702 | thedailyvox | RediTlhabi | Our_DA | MmusiMaimane | SAPresident | TheCapeArgus | ntsikimazwai | City_Press | raediology | MissMadiba | imanrappetti | moflavadj | FloydShivambu | Netwerk24Berig |  | ewnupdates | YouTube | Netwerk24 | TimesLIVE | chatlas | LirandzuThemba | MbalulaFikile | Lean3JvV | mashiyanef | sizons | GroundUp_News | News24 | MbuyiseniNdlozi | Eusebius | Powerfm987 | pontsho_pilane | Zwelinzima1 | GugsM | ALETTAHA | shaeera_k | WitsPYA | rdm_za | Ulo_Mkat | marrakurru | dailymaverick | JacaNews | ShottaZee | Sentletse | iamSivN | StudentSpaza | TC_Africa | StellenboschUni | ShakaSisulu | CapeTalk | AshrafGarda | mailandguardian | AdHabb | HajraOmarjee | UJAPK_SRC | POWER987News | BantuHolomisa | AmandlaMobi | SAgovnews | UlrichJvV | _cosatu | NamAfricanist | TheCitizen_News | SowetanLIVE | ANCYLhq | Skhumba07 | ThuliMadonsela3 | pierredevos | ferialhaffajee | SbohSibisi | _MagnumOpus | Trevornoah | ukhozi_fm | helenzille | GarethCliff | 702JohnRobbie | DJFreshSA | baddieju | GovernmentZA | SABCiindaba | CassperNyovest | akaworldwide | equal_education | mpuwoa | simphiwedana | Phislash | SABreakingNews | WitsFMF | MapsMaponyane | MTNza | SAfmnews | tsholomash | BDliveSA | NickolausBauer | DripDripSplash1 | ParliamentofRSA | Mbalings | Springboks | Wits_News | africasacountry | noMoreANCcrime | becsplanb | Abramjee | utopianindigent | AdvDali_Mpofu | tumisole | EFFSouthAfrica | ChangeAgentSA | Tau_njipana | Sanchlet | MYANC | jabu_johnson | Adamitv | Clint_ZA | NkululekoMantsh | fabfol1 | ThatDarnKitteh | DrBladeNzimande | ClintLeBruyns | c0nvey | NeoMotloung_ | AVoiceOfReason6 | 

The significant users to be analysed are:
 ['TuksUPrising', 'FeesMustFall', 'RhodesMustFall', 'WitsFMF', 'RhodesSRC']
In [ ]:
# Perform network analysis on high profile #FeesMustFall users
 
# Create graph by adding edges from ntwrk
G2 = nx.Graph()

for node_1, node_2 in pop_ntwrk:
    G2.add_edge(node_1, node_2, weight=1)
    
print("The graph has %d nodes with %d edges." % (nx.number_of_nodes(G2), nx.number_of_edges(G2)))

pos = nx.layout.spring_layout(G2)

#Create Edges
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=0.5,color='#888'),
    hoverinfo='none',
    mode='lines')

for edge in G2.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])
    
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='Electric',
        reversescale=True,
        color=[],
        size=10,
        colorbar=dict(
            thickness=15,
            title='Number of Node Connections',
            xanchor='left',
            titleside='right'
        ),  
        line=dict(width=2)))
for node in G2.nodes():
    x, y = pos[node]
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])
    
# Add color to node points
for node, adjacencies in enumerate(G2.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
    node_info = 'Name: ' + str(adjacencies[0]) + '<br># of connections: '+str(len(adjacencies[1]))
    node_trace['text']+=tuple([node_info])
    
fig = go.Figure(data=[edge_trace, node_trace],
         layout=go.Layout(
            title='Network graph of high profile #FeesMustFall institutions',
            titlefont=dict(size=16),
            showlegend=False,
            hovermode='closest',
            margin=dict(b=20,l=5,r=5,t=40),
            annotations=[ dict(
                showarrow=False,
                xref="paper", yref="paper",
                x=0.005, y=-0.002 ) ],
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

print('Hover over the graph below for information on respective nodes.')

fig
The graph has 742 nodes with 804 edges.

From the analysis above, several inferences can be made:

  • The official FeesMustFall count seemed to have little reach. This suggests that affiliation to a particular institution was an important factor in the account's influence.
  • The two accounts with the most reach were RhodesMustFall (the UCT account) and WitsFMF. Most of the protest activity around the country originated from outbursts at these two universities. The large reach of these two accounts partly explains why these two universities were at the forefront of the movement.
  • There is a cluster of users connecting the Wits and UCT accounts suggesting that the two universities co-operated in mobilising students.
  • While still influential, UP (University of Pretoria) and Rhodes University's reach did not extend as far as the aforementioned institutions. They were also more isolated in their efforts to reduce fees illustrated by the fact that they have few links to the other universities.

*RhodesMustFall is the UCT account, RhodesSRC is the Rhodes University account

Summary of findings

UCT and Wits were pivotal to the #FeesMustFall movement. There were strong links that tied the two institutions together. UP and Rhodes University were also influential, albeit to a lesser extent.

Objective 3: Determining public perception towards the protests

Text sentiment analysis

TextBlob is built over the NLTK library. The library aids sentiment analysis by:

  1. Tokenising the tweet i.e. splitting the words from the body of the text.
  2. Removing stopwords (words that do not impact sentiment).
  3. Passing tokens through a sentiment classifier - this classifier has been trained on a labelled movie reviews dataset using Naive Bayes.
  4. Assigning a polarity values to each tweet - high values indicate positive sentiment.

The classifier groups tweets into three classes:

  1. Positive: Tweets that have an optimistic outlook on the #FeesMustFall movement.
  2. Neutral: Tweets that are neither positive nor negative on the topic.
  3. Negative: Tweets surrounding negative sentiment in relation to #FeesMustFall

vaderSentiment (Valence Aware Dictionary and sEntiment Reasoner) is another tool used to classify tweet sentiment. The process followed is similar to the one described above until the final step, where vaderSentiment uses different training data and models to assign sentiment polarity scores. More information can be found here.

In order to anaylse performance of the two models, the sentiment on the first 100 tweets is manually attributed. This is then compared to each respective model's predicted sentiment. The results are presented in the form of an ROC curve.

Because the sentiment analysis includes three classes, regular binary classification cannot be used. A OneVsAll multiclass classification technique is adopted. This entails plotting an ROC curve for each class where that class is considered the 'positive' class and all other classes are considered the 'negative' class. This technique is applied to both sentiment analysis models after restricting the data to English tweets.

In [ ]:
# Perform sentiment analysis
# All data is prepared in this cell - plots are completed in later cells
# Checking whether data is English and assigning sentiment polarity scores took
# around 3 hours to run 

def clean_tweet(tweet): 
    '''Cleans tweets by removing links, special characters 
    using regex statements.'''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", 
                           tweet).split()) 

def is_english(tweet):  
    '''Returns true if the tweet is in English. Returns False if
    the tweet is not English or if it returns an error'''
    try:
        if detect(tweet) == 'en':
            return True
        else:
            return False
    except:
        return False
    
# Create functions for TextBlob sentiment analysis
def get_tweet_sentiment_score(tweet): 
    '''Takes a tweet as an input and returns its sentiment polarity score'''
    analysis = TextBlob(clean_tweet(tweet)) 
    return analysis.sentiment.polarity

def assign_tweet_sentiment(polarity, threshold):
    '''Takes 2 arguments:
    1. The polarity score for a tweet
    2. The threshold above which a tweet is considered to have positive polarity
    Returns the sentiment of the tweet (positive, negative or neutral)'''
    if polarity > threshold: 
        return 'Positive'
    elif polarity == threshold: 
        return 'Neutral'
    else: 
        return 'Negative'
    
# Create function for vaderSentiment analysis
def compound_score(tweet):
    '''Takes a tweet as an input, cleans the tweet and returns
    the vaderSentiment compound polarity score'''
    analyzer = SentimentIntensityAnalyzer()
    tweet = clean_tweet(tweet)
    return analyzer.polarity_scores(tweet)['compound']

'''
# Replicate df to be manipulated
df_man = df.iloc[:]

# Limit the analysis to tweets in English as the TextBlob library was trained on English data
df_man = df_man[df_man.text.apply(is_english)] # 22236 tweets are removed

# Perform TextBlob sentiment analysis
df_man['TB_sentiment_score'] = df_man.text.apply(get_tweet_sentiment_score)

# Perform vaderSentiment analysis
df_man['VS_sentiment_score'] = df_man.text.apply(compound_score)

df_man = df_man.reset_index()
df_man.drop(columns='index', inplace = True)

# Manually attribute sentiments for first 100 tweets to test accuracy
df_man['true_sentiment'] = 'n/a'
positive_pos = [1,3,4,5,6,7,9,10,11,13,15,17,18,
               22,23,24,26,27,28,29,30,31,32,33,35,
               37,40,46,50,55,56,57,58,59,66,
               67,68,69,73,75,77,79,83,
               87,88,89,95,96]
negative_pos = [12,85,98,99,100]
neutral_pos = [0,2,8,14,16,19,20,21,25,
              34,36,38,39,41,42,43,44,45,47,
              48,49,51,52,53,54,60,61,62,63,
              64,65,70,71,72,74,76,78,80,81,82,84,
              86,90,91,92,93,94,97]

df_man['true_sentiment'].loc[positive_pos] = 'Positive'
df_man['true_sentiment'].loc[negative_pos] = 'Negative'
df_man['true_sentiment'].loc[neutral_pos] = 'Neutral'

df_man.to_pickle('pickle_files/df_man')
'''

with open('pickle_files/df_man', 'rb') as handle:
    df_man = pickle.load(handle)

# Display sentiment columns
df_man.iloc[:3,[0,5,9,10,11]]
In [ ]:
# Compute ROC curves for different techniques

classes = ['Positive', 'Negative', 'Neutral']
y_test_score = df_man.loc[:100][['true_sentiment', 'TB_sentiment_score', 'VS_sentiment_score']]

# Compute ROC curve and ROC area for each class (TextBlob)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in classes : # we have three classes
    fpr[i], tpr[i], _ = roc_curve(y_test_score.true_sentiment, 
                                  y_test_score.TB_sentiment_score,
                                  pos_label = i)
    roc_auc[i] = auc(fpr[i], tpr[i])


# Plot of a ROC curve for a specific class
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(18, 6))    
for i in classes:
    ax1.plot(fpr[i], tpr[i], label= i + ' (area = %0.2f)' % roc_auc[i])
    ax1.plot([0, 1], [0, 1], 'k--')
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.set_title('ROC curve: TextBlob')
    ax1.legend(loc="lower right")
    
# Compute ROC curve and ROC area for each class (vaderSentiment)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in classes : # we have three classes
    fpr[i], tpr[i], _ = roc_curve(y_test_score.true_sentiment, 
                                  y_test_score.VS_sentiment_score,
                                  pos_label = i)
    roc_auc[i] = auc(fpr[i], tpr[i])
    
for i in classes:
    ax2.plot(fpr[i], tpr[i], label= i + ' (area = %0.2f)' % roc_auc[i])
    ax2.plot([0, 1], [0, 1], 'k--')
    ax2.set_xlim([0.0, 1.0])
    ax2.set_ylim([0.0, 1.05])
    ax2.set_xlabel('False Positive Rate')
    ax2.set_ylabel('True Positive Rate')
    ax2.set_title('ROC curve: vaderSentiment')
    ax2.legend(loc="lower right")
    
plt.show()

It is important to note that the neutral ROC curve is non-informative. Because values within a small range of a polarity score of zero are classified as neutral, the threshold restriction would have to be expressed as an absolute value i.e. $|Threshold|<0.05$. For this reason, an ROC plot is not appropriate to illustrate the ideal threshold in this context - the curve is only included for completeness purposes.

From the above, it can be seen that both methods perform similarly. Because the TextBlob method performs marginally better on the positive sentiment curve and because it is trained on a larger dataset, we adopt this classifier. The classifier is applied to all tweets - results are illustrated in the following cell.

In [ ]:
# Visualise sentiment split

# Group data to analyse sentiment split
df_man['TB_sentiment_pred'] = df_man.TB_sentiment_score.apply(assign_tweet_sentiment, args = (0,))
sentiment_count = df_man.groupby('TB_sentiment_pred')[['username']].count()
sentiment_count.rename({'username': 'count'},
                 axis='columns', inplace = True)

sent_fig = make_subplots(rows=1, cols=2, specs=[[{'type':'pie'}, {'type':'xy'}]])

colors = ['gold', 'mediumturquoise', 'darkorange']

sent_fig.add_trace(go.Pie(labels=['Negative', 'Neutral', 'Positive'],
                          values=[sent for sent in sentiment_count['count']]),
                   row=1, col=1)
                          
sent_fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                      marker=dict(colors=colors, line=dict(color='#000000', width=2)))

sent_fig.update_layout(title={'text': '#FeesMustFall Sentiment Split',
                                'y':0.95,
                                'x':0.5,
                                'xanchor': 'center',
                                'yanchor': 'top'})

sent_fig.add_trace(go.Bar(x=['Negative', 'Neutral', 'Positive'],
                          y=[sent for sent in sentiment_count['count']],
                          marker_color=colors, showlegend = False,
                         hoverinfo = 'y'),
                  row=1, col=2) 

sent_fig.show()

The above plots suggest that more than half of all tweets are neutral while positive sentiment towards the movement is almost double that of negative sentiment. Finally, a wordcloud is computed below:

In [ ]:
# Create a string of all words in the tweets

text = " ".join([i for i in df_man['text']])

# Remove common, irrelevant words
stopwords = set(STOPWORDS)
more_stopwords = ['twitter','pic','https','ly','ow','FeesMustFall',
                 'co','za','need','bit','say','want','come','fb',
                 'fees','must','fall','RT','instagram','South','Africa']
for each_word in more_stopwords:
    stopwords.add(each_word)

wc = WordCloud(background_color="white", stopwords=stopwords, scale = 4,
               width=2500, height=1000).generate(text)

plt.figure(num=None, figsize=(15, 5), dpi=80)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

The word cloud above reaffirms the results seen in the split between positive and negative sentiment towards the movement. Words such as 'National Shutdown' and 'Free Education' take up large spaces in the word cloud and are terms likely to be used by those in favour of the movement. Negative words like 'fragmented' appear smaller on the above graphic reinforcing that there is a larger proportion of positive sentiment towards the movement.

Summary of findings

While most of the tweets are neutral commentary on the topic, a larger proportion of the polar tweets encompass a positive outlook on the movement as opposed to a negative one.

Conclusion

From the above analysis, three primary conclusions can be reached:

  1. There is a strong correlation between Twitter activity and protest action evidenced by the spike in activity during times of protest.
  2. Wits and UCT were the driving institutions behind the protests.
  3. While negativity towards the movement existed, a larger proportion of tweeters viewed the movement in a positive light.

Comments on each part

1. Introduction and data description

The introduction is a nice read. Interesting topic with good motivations. Clear explanation of the datasets.

2. Formulation of research question

Very clear formulation of research questions and objectives.

3. Data acquisition

Through the use of web-scraping and Twitter API, demonstrating strong abilities in data acquisition. I think it takes a lot of efforts to get the entire dataset.

4. Data cleaning and reshaping

Very neat code with detailed explanations. The section is well-organized according to different datasets obtained.

5. Visualization

Various kinds of plots as well as a word cloud. The interactive plots are amazing. In addition, findings from each figure/plot are summarized and explained in detail. The ROC plot in code cell [17] could potentially be made better by adjusting the line width.

6. Data modelling

Including text sentiment analysis based on multi-label classifications. Multi-level ROC's are plotted to compare different classifiers. This is a very clever way to incorporate data modelling for the project.

7. Conclusion and the overall structure

Good conclusions and very nice structure. All the results are explained in detail. The notebook is very easy to access and follow. Thanks for your hard work!